In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
import matplotlib as mpl
import os
#sns.set()
%matplotlib inline
Nuestros datos: US air pollution¶
In [4]:
f = open('../data/usairpollution.txt', 'r')
print(f.read())
Air Pollution in US Cities
Description:
Air pollution data of 41 US cities.
Format:
A data frame with 41 observations on the following 7 variables.
‘SO2’ SO2 content of air in micrograms per cubic metre.
‘temp’ average annual temperature in Fahrenheit.
‘manu’ number of manufacturing enterprises employing 20 or more workers.
‘popul’ population size (1970 census); in thousands.
‘wind’ average annual wind speed in miles per hour.
‘precip’ average annual precipitation in inches.
‘predays’ average number of days with precipitation per year.
Details:
The annual mean concentration of sulphur dioxide, in micrograms per cubic metre, is a measure of the air pollution of the city.
The question of interest here is what aspects of climate and human ecology as measured by the other six variables in the data determine pollution?
Source:
R. R. Sokal and F. J. Rohlf (1981), _Biometry_, W. H. Freeman, San
Francisco (2nd edition).
In [5]:
USairpollution = pd.read_csv('../data/usairpollution.csv', index_col=0)
USairpollution
Out[5]:
| SO2 | temp | manu | popul | wind | precip | predays | |
|---|---|---|---|---|---|---|---|
| City | |||||||
| Albany | 46 | 47.6 | 44 | 116 | 8.8 | 33.36 | 135 |
| Albuquerque | 11 | 56.8 | 46 | 244 | 8.9 | 7.77 | 58 |
| Atlanta | 24 | 61.5 | 368 | 497 | 9.1 | 48.34 | 115 |
| Baltimore | 47 | 55.0 | 625 | 905 | 9.6 | 41.31 | 111 |
| Buffalo | 11 | 47.1 | 391 | 463 | 12.4 | 36.11 | 166 |
| Charleston | 31 | 55.2 | 35 | 71 | 6.5 | 40.75 | 148 |
| Chicago | 110 | 50.6 | 3344 | 3369 | 10.4 | 34.44 | 122 |
| Cincinnati | 23 | 54.0 | 462 | 453 | 7.1 | 39.04 | 132 |
| Cleveland | 65 | 49.7 | 1007 | 751 | 10.9 | 34.99 | 155 |
| Columbus | 26 | 51.5 | 266 | 540 | 8.6 | 37.01 | 134 |
| Dallas | 9 | 66.2 | 641 | 844 | 10.9 | 35.94 | 78 |
| Denver | 17 | 51.9 | 454 | 515 | 9.0 | 12.95 | 86 |
| Des Moines | 17 | 49.0 | 104 | 201 | 11.2 | 30.85 | 103 |
| Detroit | 35 | 49.9 | 1064 | 1513 | 10.1 | 30.96 | 129 |
| Hartford | 56 | 49.1 | 412 | 158 | 9.0 | 43.37 | 127 |
| Houston | 10 | 68.9 | 721 | 1233 | 10.8 | 48.19 | 103 |
| Indianapolis | 28 | 52.3 | 361 | 746 | 9.7 | 38.74 | 121 |
| Jacksonville | 14 | 68.4 | 136 | 529 | 8.8 | 54.47 | 116 |
| Kansas City | 14 | 54.5 | 381 | 507 | 10.0 | 37.00 | 99 |
| Little Rock | 13 | 61.0 | 91 | 132 | 8.2 | 48.52 | 100 |
| Louisville | 30 | 55.6 | 291 | 593 | 8.3 | 43.11 | 123 |
| Memphis | 10 | 61.6 | 337 | 624 | 9.2 | 49.10 | 105 |
| Miami | 10 | 75.5 | 207 | 335 | 9.0 | 59.80 | 128 |
| Milwaukee | 16 | 45.7 | 569 | 717 | 11.8 | 29.07 | 123 |
| Minneapolis | 29 | 43.5 | 699 | 744 | 10.6 | 25.94 | 137 |
| Nashville | 18 | 59.4 | 275 | 448 | 7.9 | 46.00 | 119 |
| New Orleans | 9 | 68.3 | 204 | 361 | 8.4 | 56.77 | 113 |
| Norfolk | 31 | 59.3 | 96 | 308 | 10.6 | 44.68 | 116 |
| Omaha | 14 | 51.5 | 181 | 347 | 10.9 | 30.18 | 98 |
| Philadelphia | 69 | 54.6 | 1692 | 1950 | 9.6 | 39.93 | 115 |
| Phoenix | 10 | 70.3 | 213 | 582 | 6.0 | 7.05 | 36 |
| Pittsburgh | 61 | 50.4 | 347 | 520 | 9.4 | 36.22 | 147 |
| Providence | 94 | 50.0 | 343 | 179 | 10.6 | 42.75 | 125 |
| Richmond | 26 | 57.8 | 197 | 299 | 7.6 | 42.59 | 115 |
| Salt Lake City | 28 | 51.0 | 137 | 176 | 8.7 | 15.17 | 89 |
| San Francisco | 12 | 56.7 | 453 | 716 | 8.7 | 20.66 | 67 |
| Seattle | 29 | 51.1 | 379 | 531 | 9.4 | 38.79 | 164 |
| St. Louis | 56 | 55.9 | 775 | 622 | 9.5 | 35.89 | 105 |
| Washington | 29 | 57.3 | 434 | 757 | 9.3 | 38.89 | 111 |
| Wichita | 8 | 56.6 | 125 | 277 | 12.7 | 30.58 | 82 |
| Wilmington | 36 | 54.0 | 80 | 80 | 9.0 | 40.25 | 114 |
Quitamos $\verb|SO2|$ por el momento, y la variable $\verb|temp|$ se transforma en valores negativos negtemp, para que valores altos de todas las variables, representen un lugar con medio ambiente "poco atractivo"
In [6]:
pollution = USairpollution.drop(['SO2'], axis = 1)
pollution['temp'] = pollution['temp']*-1
pollution.rename(columns={'temp':'negtemp'}, inplace=True)
pollution
Out[6]:
| negtemp | manu | popul | wind | precip | predays | |
|---|---|---|---|---|---|---|
| City | ||||||
| Albany | -47.6 | 44 | 116 | 8.8 | 33.36 | 135 |
| Albuquerque | -56.8 | 46 | 244 | 8.9 | 7.77 | 58 |
| Atlanta | -61.5 | 368 | 497 | 9.1 | 48.34 | 115 |
| Baltimore | -55.0 | 625 | 905 | 9.6 | 41.31 | 111 |
| Buffalo | -47.1 | 391 | 463 | 12.4 | 36.11 | 166 |
| Charleston | -55.2 | 35 | 71 | 6.5 | 40.75 | 148 |
| Chicago | -50.6 | 3344 | 3369 | 10.4 | 34.44 | 122 |
| Cincinnati | -54.0 | 462 | 453 | 7.1 | 39.04 | 132 |
| Cleveland | -49.7 | 1007 | 751 | 10.9 | 34.99 | 155 |
| Columbus | -51.5 | 266 | 540 | 8.6 | 37.01 | 134 |
| Dallas | -66.2 | 641 | 844 | 10.9 | 35.94 | 78 |
| Denver | -51.9 | 454 | 515 | 9.0 | 12.95 | 86 |
| Des Moines | -49.0 | 104 | 201 | 11.2 | 30.85 | 103 |
| Detroit | -49.9 | 1064 | 1513 | 10.1 | 30.96 | 129 |
| Hartford | -49.1 | 412 | 158 | 9.0 | 43.37 | 127 |
| Houston | -68.9 | 721 | 1233 | 10.8 | 48.19 | 103 |
| Indianapolis | -52.3 | 361 | 746 | 9.7 | 38.74 | 121 |
| Jacksonville | -68.4 | 136 | 529 | 8.8 | 54.47 | 116 |
| Kansas City | -54.5 | 381 | 507 | 10.0 | 37.00 | 99 |
| Little Rock | -61.0 | 91 | 132 | 8.2 | 48.52 | 100 |
| Louisville | -55.6 | 291 | 593 | 8.3 | 43.11 | 123 |
| Memphis | -61.6 | 337 | 624 | 9.2 | 49.10 | 105 |
| Miami | -75.5 | 207 | 335 | 9.0 | 59.80 | 128 |
| Milwaukee | -45.7 | 569 | 717 | 11.8 | 29.07 | 123 |
| Minneapolis | -43.5 | 699 | 744 | 10.6 | 25.94 | 137 |
| Nashville | -59.4 | 275 | 448 | 7.9 | 46.00 | 119 |
| New Orleans | -68.3 | 204 | 361 | 8.4 | 56.77 | 113 |
| Norfolk | -59.3 | 96 | 308 | 10.6 | 44.68 | 116 |
| Omaha | -51.5 | 181 | 347 | 10.9 | 30.18 | 98 |
| Philadelphia | -54.6 | 1692 | 1950 | 9.6 | 39.93 | 115 |
| Phoenix | -70.3 | 213 | 582 | 6.0 | 7.05 | 36 |
| Pittsburgh | -50.4 | 347 | 520 | 9.4 | 36.22 | 147 |
| Providence | -50.0 | 343 | 179 | 10.6 | 42.75 | 125 |
| Richmond | -57.8 | 197 | 299 | 7.6 | 42.59 | 115 |
| Salt Lake City | -51.0 | 137 | 176 | 8.7 | 15.17 | 89 |
| San Francisco | -56.7 | 453 | 716 | 8.7 | 20.66 | 67 |
| Seattle | -51.1 | 379 | 531 | 9.4 | 38.79 | 164 |
| St. Louis | -55.9 | 775 | 622 | 9.5 | 35.89 | 105 |
| Washington | -57.3 | 434 | 757 | 9.3 | 38.89 | 111 |
| Wichita | -56.6 | 125 | 277 | 12.7 | 30.58 | 82 |
| Wilmington | -54.0 | 80 | 80 | 9.0 | 40.25 | 114 |
PCA haciendo la descomposición espectral directamente¶
In [7]:
X = StandardScaler().fit_transform(pollution)
cov_x = np.cov(X.T)
vals, vecs = np.linalg.eig(cov_x)
# la descomposicion hecha con el modulo de algebra lineal de numpy no da los eigenvalores ordenados,
# entonces, los ordenamos de forma descendente
idx = vals.argsort()[::-1]
eigvals = vals[idx]
eigvecs = vecs[:,idx]
Varianza explicada (o desviación estándar)
In [6]:
dat = {'PC':range(1,7),'std':np.sqrt(eigvals), 'var_prop':eigvals/sum(eigvals),
'cum_prop':np.cumsum(eigvals/sum(eigvals))}
stds = pd.DataFrame(data = dat)
stds
Out[6]:
| PC | std | var_prop | cum_prop | |
|---|---|---|---|---|
| 0 | 1 | 1.500356 | 0.366027 | 0.366027 |
| 1 | 2 | 1.239936 | 0.249991 | 0.616018 |
| 2 | 3 | 1.195623 | 0.232442 | 0.848459 |
| 3 | 4 | 0.882742 | 0.126704 | 0.975164 |
| 4 | 5 | 0.342688 | 0.019095 | 0.994259 |
| 5 | 6 | 0.187905 | 0.005741 | 1.000000 |
In [8]:
plt.figure(figsize=(8, 8))
sns.lineplot(x="PC", y="std", data=stds)
plt.title('Screeplot', fontsize=15)
plt.show()
In [9]:
plt.figure(figsize=(8, 8))
sns.lineplot(x="PC", y="cum_prop", data=stds)
plt.title('Varianza acumulada', fontsize=15)
plt.show()
In [10]:
# componentes
comps = pd.DataFrame(data=eigvecs.T, columns=pollution.columns,
index=['pc1','pc2','pc3','pc4','pc5','pc6'])
print(comps.T)
pc1 pc2 pc3 pc4 pc5 pc6 negtemp -0.329646 0.127597 -0.671686 0.306457 -0.558056 -0.136188 manu -0.611542 -0.168058 0.272886 0.136841 -0.102042 0.702971 popul -0.577822 -0.222453 0.350374 0.072481 0.078066 -0.694641 wind -0.353839 0.130792 -0.297253 -0.869426 0.113267 0.024525 precip 0.040807 0.622858 0.504563 -0.171148 -0.568183 -0.060622 predays -0.237916 0.707765 -0.093089 0.311307 0.580004 0.021961
Con sklearn¶
In [8]:
from sklearn.decomposition import PCA
pca = PCA()
# ajustar en los datos (estandarizados)
pca.fit(X)
dat = {'PC':range(1,7),'std':np.sqrt(pca.explained_variance_), 'var_prop':pca.explained_variance_ratio_,
'cum_prop':np.cumsum(pca.explained_variance_ratio_)}
stds = pd.DataFrame(data = dat)
stds
Out[8]:
| PC | std | var_prop | cum_prop | |
|---|---|---|---|---|
| 0 | 1 | 1.500356 | 0.366027 | 0.366027 |
| 1 | 2 | 1.239936 | 0.249991 | 0.616018 |
| 2 | 3 | 1.195623 | 0.232442 | 0.848459 |
| 3 | 4 | 0.882742 | 0.126704 | 0.975164 |
| 4 | 5 | 0.342688 | 0.019095 | 0.994259 |
| 5 | 6 | 0.187905 | 0.005741 | 1.000000 |
Loadings¶
Loadings (vectores propios de $\mathbf{S}$ o $\mathbf{R}$)
- Magnitud
- Signos
- Contrastes
Observa que, en este ejemplo, dos variables se relacionan con “ecologı́a humana” (popul y manu del primer PC) y cuatro relacionadas al clima (temp, wind, precip y predays).
In [9]:
comps = pd.DataFrame(data=pca.components_.T, columns=['pc1','pc2','pc3','pc4','pc5','pc6'],
index=pollution.columns)
print(comps)
pc1 pc2 pc3 pc4 pc5 pc6 negtemp 0.329646 -0.127597 -0.671686 -0.306457 -0.558056 -0.136188 manu 0.611542 0.168058 0.272886 -0.136841 -0.102042 0.702971 popul 0.577822 0.222453 0.350374 -0.072481 0.078066 -0.694641 wind 0.353839 -0.130792 -0.297253 0.869426 0.113267 0.024525 precip -0.040807 -0.622858 0.504563 0.171148 -0.568183 -0.060622 predays 0.237916 -0.707765 -0.093089 -0.311307 0.580004 0.021961
Scores¶
Proyección de los datos en los componentes principales
In [10]:
# proyectar datos
proj = pd.DataFrame(pca.transform(X),columns = ['pc1','pc2','pc3','pc4','pc5','pc6'])
sns.set()
sns.pairplot(proj, height=2);
In [11]:
import plotly.express as px
pca_dataset = pd.DataFrame({'pc1': proj['pc1'], 'pc2': proj['pc2'], 'city': pollution.index})
fig = px.scatter(pca_dataset, x='pc1', y='pc2', hover_data=['city'])
fig.update_layout(
autosize=False,
width=600,
height=600,
)
fig.show()
Biplot¶
In [28]:
# usando el módulo pca (https://pypi.org/project/pca/)
from pca import pca
model = pca(n_components=3)
# Fit transform
results = model.fit_transform(X, row_labels=pollution.index, col_labels=pollution.columns, verbose=False)
model.biplot(figsize=(10, 10), verbose=False)
[scatterd] >INFO> Create scatterplot
Out[28]:
(<Figure size 1000x1000 with 1 Axes>,
<Axes: title={'center': '3 Principal Components explain [84.84%] of the variance'}, xlabel='PC1 (36.6% expl.var)', ylabel='PC2 (24.9% expl.var)'>)
